data confidentiality and keyword search in the cloud using...

Data confidentiality and keyword searchin the cloud using visual cryptography

Varun Maheshwari

School of Computer ScienceMcGill UniversityMontreal, Canada

December 2011

A thesis submitted to McGill University in partial fulfillment of the requirements forthe degree of Master of Science.

c© 2011 Varun Maheshwari

Dedication

To Mummy and Rohan, for their undying love and support.

iv

Acknowledgements This thesis would have been impossible without the support and mentoring of my supervisor, Muthucumaru Maheswaran. I was constantly inspired by his expertise on the subject matter, innovative insights and boundless optimism. His timely suggestions and guidance were unparallel towards the completion of this work.

A warm thanks to my colleagues at the Advanced Networking Research Laboratory for a wonderful and rewarding time. I received important suggestions and insights in the analysis of my results from Arash Nourian. A special thanks to Yijia Xu for proofreading my thesis. I am also specially thankful to Vince Forgetta for translating the abstract into French. I would also like to extend my appreciation of the facilities provided by the Laboratory, School of Computer Science and McGill University.

Finally, my heartfelt thanks to my mother and brother for always being there for me, and to whom I owe everything.

v

Abstract Security has emerged as the most feared aspect of cloud computing and a major hindrance for the customers. Current cloud framework does not allow encrypted data to be stored due to the absence of efficient searchable encryption schemes that allow query execution on a cloud database. Storing unencrypted data exposes the data not only to an external attacker but also to the cloud provider itself. Thus, trusting a provider with confidential data is highly risky.

To enable querying on a cloud database without compromising data confidentiality, we propose to use data obfuscation through visual cryptography. A new scheme for visual cryptography is developed and configured for the cloud for storing and retrieving textual data. Testing the system with query execution on a cloud database indicates full accuracy in record retrievals with negligible false positives. In addition, the system is resilient to attacks from within and outside the cloud. Since standard encryption and key management are avoided, our approach is computationally efficient and data confidentiality is maintained.

vi

Résumé La sécurité a émergé comme l'aspect le plus redouté de l’informatique en nuage et comme un obstacle majeur pour les clients. Le cadre actuel de l’informatique en nuage ne permet pas que les données chiffrées soient stockées en raison de l'absence de schémas efficaces de cryptage qui permettent l'exécution des requêtes sur une base de données des nuages. Le stockage des données non cryptées expose les données non seulement à un agresseur extérieur, mais aussi au fournisseur de nuage lui-même. Ainsi, faire confiance à un fournisseur avec des données confidentielles est très risqué.

Afin de permettre des requêtes sur une base de données des nuages sans compromettre la confidentialité des données, nous proposons d'utiliser l’obscurcissement des données à travers la cryptographie visuelle. Un nouveau schéma pour la cryptographie visuelle est développé et configuré pour le nuage pour stocker et récupérer des données textuelles. Tester le système avec l'exécution des requêtes sur une base de données nuée indique une grande précision dans la récupération des enregistrements avec négligeables faux positifs. En outre, le système est résistant aux attaques de l'intérieur et l'extérieur du nuage. Parce que le cryptage standard et la gestion des clés sont évités, notre approche est mathématiquement efficace et la confidentialité des données est assurée.

vii

Contents

Acknowledgement iv

Abstract v

Contents vii

List of Figures ix

List of Tables x 1 Introduction 1 1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.2 Thesis contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 1.3 Thesis organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6 2 Related Work 7 2.1 Encrypted cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1.1 Data encryption . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.1.2 Searching on encrypted data . . . . . . . . . . . . . . . . . . . . . . . . 9 2.1.3 Infeasibility of encryption alone for the cloud . . . . . . . . . . . . 12 2.2 Data obfuscation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.3 Visual cryptography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 3 Data Confidentiality using Visual Cryptography 16 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16 3.2 Overall concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 3.3 Sending data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18 3.3.1 Converting text to image . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3.2 Image obfuscation using noise . . . . . . . . . . . . . . . . . . . . . . . 21 3.3.3 Data division across the cloud . . . . . . . . . . . . . . . . . . . . . . . 25 3.3.4 Sending algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

viii

3.4 Retrieving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 3.4.1 Image retrieval from the records . . . . . . . . . . . . . . . . . . . . . 30 3.4.2 Matching the image . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 3.4.3 Retrieval algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.5 Complexity analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 3.6 Security Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40 4 Results 42 4.1 System design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42 4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3 Parameter estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 4.3.1 Noise parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.3.2 Sending and retrieving data . . . . . . . . . . . . . . . . . . . . . . . . 48 4.3.3 NCC threshold . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50 4.4 Multiple queries on large dataset . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.5 Running time . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60 5 Threat Analysis 63 5.1 Threat scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 5.2 Experimental setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65 5.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.4 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 6 Conclusions and Future Work 73 6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 7 References 76

ix

List of Figures

3.1 Overall system concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173.2 Image library of ASCII characters . . . . . . . . . . . . . . . . . . . . . . . . . . . 203.3 Normal distribution at different mean and variance . . . . . . . . . . . . . . . 223.4 Gaussian noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233.5 Speckle noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.6 Data carried by each cloud . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253.7 Sending data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 283.8 Problem with creating the mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 323.9 Creating the correct mask . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 333.10 Pattern matching with NCC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 343.11 Retrieving data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 374.1 Mean PSNR of all ASCII characters for Gaussian noise . . . . . . . . . . . . 454.2 Slope of mean PSNR of all ASCII characters for Gaussian noise . . . . . . 454.3 Images at different ( , ) with their PSNR values in dB . . . . . . . . . 464.4 Mean PSNR of all ASCII characters for speckle noise . . . . . . . . . . . . . 474.5 Slope of mean PSNR of all ASCII characters for speckle noise . . . . . . . 474.6 Optimum NCC threshold for Gaussian noise . . . . . . . . . . . . . . . . . . . . 534.7 Optimum NCC threshold for speckle noise . . . . . . . . . . . . . . . . . . . . . 544.8 Data retrieval on >5K and >10K character dataset . . . . . . . . . . . . . . . 595.1 Data when one cloud out of four is breached . . . . . . . . . . . . . . . . . . . . 645.2 Data when two clouds out of four are breached . . . . . . . . . . . . . . . . . . 645.3 False positives and successful search with four clouds . . . . . . . . . . . . . . 685.4 False positives and successful search with eight clouds . . . . . . . . . . . . . 68

x

List of Tables

4.1 NCC for Gaussian noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 NCC for speckle noise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3 NCC threshold estimation for Gaussian noise . . . . . . . . . . . . . . . . . . . 52

4.4 NCC threshold estimation for speckle noise . . . . . . . . . . . . . . . . . . . . . 54

4.5 Data retrieval with multiple search queries on a>5K character dataset . . 56

4.6 Data retrieval with multiple search queries on a>10K character dataset . 58

4.7 Running time measurements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.1 Threat scenarios tested . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2 Results for different attack scenarios . . . . . . . . . . . . . . . . . . . . . . . . . . 69

1

Chapter 1

Introduction

In this chapter we briefly present the current scenario in the domain of cloud

computing security and the motivation for our thesis. The general organization of

the thesis is given at the end.

1.1 Motivation The ever increasing demand for large scale computing, combined with advances in

low cost yet fast networking technologies, has helped cloud computing to emerge

as a promising computing model. In contrast to traditional IT services, it

distinguishes itself as a high-performance Internet based technology which is

economically sustainable, both in terms of initial setup and maintenance. With a

slated revenue growth set to rise to about $150 billion by 2014 [23], cloud

computing has finally arrived on the IT landscape.

Ac loud provider offers numerous services, however, they can be essentially

classified into following categories [8]:

Infrastructure as a Service (IaaS): the consumer employs the

computing, storage and networking infrastructure from the provider

2 Introduction

and uses it to deploy and run its own software. The consumer

controls the operating system, storage and applications, however,

does not need to manage the underlying infrastructure.

Platform as a Service (PaaS): consumer can deploy its own software

applications on the provider's infrastructure. Like in IaaS, the

consumer does not manage the underlying infrastructure, however,

has control over the custom applications.

Software as a Service (SaaS): a consumer uses a cloud provider's

computing, networking, storage and individual application capabilities

without worrying to manage or control the underlying infrastructure.

Applications are accessible to various clients such as via a web-

browser.

Despite the service model of a cloud, the clouds may be deployed as one of

these:

Public cloud: cloud infrastructure is owned and managed by an

organization (cloud service provider) and is located outside the

customer's premises. The data is thus, out of the customer's control.

Providers like Amazon, Google and Microsoft come under this

category.

Private cloud: the infrastructure is owned and controlled by the

customer itself and located within the customer's premises. In

contrast to the public cloud, data is under the customer's control and

hence access is only given to trusted parties.

3

Hybrid cloud: two or more clouds, combining public and private ones

make a hybrid cloud, where specific parts of the infrastructure and

applications lie in public or private clouds.

While the benefits of using the cloud are clear and understood, some

problems remain unsolved. The biggest hurdle to existing cloud services is

security. The public cloud provider acting as a custodian of a customer's

confidential data has become a major concern for organizations planning a shift to

the cloud. Numerous studies, for instance, the IDC Cloud Services User Survey

identifies security as the primary concern for about three-fourth of IT

executives/CIOs.

The issue of security in cloud computing has gained attention in the

academia [4], [13], industry [33] and the government [38]. Potential customers,

especially enterprises and government organizations, which hold a lot of critical

data such as financial or medical records, are unwilling to trade the privacy for

the performance that the cloud promises. Major security loopholes have been

exposed not only in the early days of cloud computing with Google Docs [46], but

also recently with Amazon [18].

A good cloud provider must provide the following:

Data confidentiality: the cloud provider should not learn anything

from the customer data

Integrity: if a provider modifies the customer data, all such

unauthorized modifications must be detectable by the customer

1.1 Motivation

4 Introduction

Availability: data should be accessible by the customer from any

machine and at anytime

Also, the following are desirable in conjunction with the above:

Reliability: customer data should be backed up

Efficient retrieval: data should be retrievable efficiently

Sharing: customer should be able to share data with parties they

trust

Another concern is the geographical location of the data, as discussed in

[31], since there are currently no internationally agreed rules on data protection

and privacy. Depending on where the cloud provider has stored your data, that

country may exercise its right to investigate a customer's data. For instance, a

Canadian customer might be concerned about using SaaS in the United States

given the USA Patriot Act [43].

A solution to all these is using encryption/decryption scheme for data

storage and retrieval. However, this approach is highly inefficient and raises other

difficulties in its implementation. More importantly, it hinders the efficient

implementation of a database system on the cloud which can process queries while

maintaining data confidentiality and other attributes. This renders the standard

cryptography methods useless on the cloud, which is expected to be economical

yet deliver a high-performance. Hence, a new method for efficient data retrieval

from the cloud without compromising on data confidentiality is required.

5

1.2 Thesis contribution The main contribution of this thesis is a novel procedure to send and retrieve data

to and from a cloud using database style query without using standard

cryptography schemes, and thus offers efficient retrievals while maintaining data

confidentiality.

We propose to use data obfuscation instead of an encryption/decryption

scheme to achieve data confidentiality. In this work, we have come up with a

novel procedure for visual cryptography, which we will use to conduct obfuscation.

We show that using our procedure, information cannot be understood by the

cloud and is only decipherable by the user. Further, we show our system can be

used to retrieve records from a database using a database style query. Our system

can retrieve records which satisfy a single query. Moreover, the system is sensitive

to uppercase and lowercase characters and queries containing non-alphanumeric

characters such as the space character and brackets.

For this thesis we focus on retrieving records from a database which begin

with a query string, that is, equivalent to the database operation 'LIKE %query'.

We run tests with various configurations and analyze the possible threat

component to prove the practicality and confidentiality of our system.

To the best of our knowledge, this is the first work that uses visual

cryptography as a data obfuscation technique to achieve data confidentiality on

the cloud and runs database query for effective data retrieval.

1.1 Thesis contribution

6 Introduction

1.3 Thesis organization Chapter 2 presents related work in the domain of data confidentiality on the

cloud. We discuss why present systems which employ standard cryptography

procedure have not yet stood up to the task of performing efficient retrieval from

the cloud. We also discuss present visual cryptography schemes and their

relationship with data obfuscation.

Chapter 3 presents our main work along with detailed algorithms, working

examples and discussion on complexity of our approach. The background

information and metric used are discussed as well.

Results are presented in Chapter 4. Discussion and analysis are presented

to adjudge data confidentiality in our system. We also discuss the running time

for our approach and how it optimized.

Chapter 5 discusses how the system reacts to threats and attacks from

outside and within the cloud. Results are presented and we discuss possible

measures to further strengthen the security framework for our system.

We conclude with a final discussion and future work in Chapter 6.

7

Chapter 2 Related Work In this chapter we discuss the present work in the domain of data confidentiality

in the cloud, namely using encryption/decryption. Later we contrast our work in

visual cryptography with the existing work.

2.1 Encrypted cloud A customer naturally wants to protect his data on the cloud from unauthorized

access. If the resources on which data is stored is owned by the customer itself,

existing authentication and/or authorization measures can protect the data from

being disclosed, lost, corrupted or stolen. However, when the data is in the hands

of a third-party, that is, a public cloud provider, data is exposed. Data can be

sabotaged by an external attacker, which may hack some part of the cloud, or

even by an employee of the resource vendor [49], [50]. This means that, a

customer wants a two-fold security envelope: to protect data from attacks outside

the cloud provider and avoid the data being visible/available to the cloud itself.

One of the methods to accomplish this is to encrypt the data on the cloud. This

section discusses this in further detail.

8 Related Work

2.1.1 Data encryption

There is a vast literature on how to perform encryption on the cloud. Various

cutting edge algorithms have been proposed and proved, at least, theoretically

that they can secure the data on the cloud. Though, only few of them have been

actually tested on realistic systems to judge their practical deployment.

A high level architecture of a cryptographic cloud is presented in [32].

Essentially, a data processor at the customer encrypts the data and metadata

(size, keywords etc.) and sends it to the cloud. The key is stored with the

customer only. A data verifier at the customer can verify the integrity of the data

at anytime. A token generator at the customer side creates a token for data

retrieval and sends it to the cloud. Cloud retrieves the encrypted data using the

token and sends it to the customer. Customer then uses the decryption key it had

to decrypt and retrieve the data. If a customer wants to grant access to another

user, a token can be sent to the new user for communicating with the cloud.

Such a scheme is suitable for simple storage operations. The Amazon

Simple Storage Service, S3, works in a similar manner. S3 authenticates its user

using encryption, but data is not encrypted by default. However, a user can

encrypt the data and store it on S3 [3].

The advantages of an encrypted cloud are well documented and discussed

such as in [8], [11], [42]. Namely, data is controlled and maintained by the user,

and cryptography provides a strong secure framework for data confidentiality.

The risk of an untrustworthy cloud which may access your data, or an attack

9

from outside the cloud are both mitigated. Data is also protected from legal

jurisdiction arising from the location of the data, such as, US government using

the Patriot Act to collect the data [19]. An encrypted data is immune to such

jurisdictions as without the key to decrypt it, encrypted data is meaningless. Also,

integrity of the data can be readily verified anytime by the user to detect security

breach or data corruption.

2.1.2 Searching on encrypted data

Although encryption seems a favourable solution to data confidentiality, an

important aspect has not yet been discussed. Can we perform efficient operations

on an encrypted data? A cloud acting as a simple storage platform is not a

feasible economic model. It must enable us to perform operations on the data. At

the very least, a cloud must act as a virtual database which can process queries

and retrieve records from. Until now no major cloud provider has come up with a

facility to search on encrypted data. For instance, on Amazon's SimpleDB,

encrypted data cannot not be used as a part of query filtering conditions [2]. The

only way to run queries on encrypted data is to retrieve the entire dataset,

decrypt it and then run queries. Clearly, it is not a practical solution for large

databases. Even with technologies like Transparent Data Encryption used by

Microsoft and Oracle [54], which encrypts the physical files rather than the data

itself, one still cannot run queries on encrypted data and must rely on decryption

of the data before querying.

2.1 Encrypted cloud

10 Related Work

In the past decade, a lot of literature has emerged in the domain of

searchable encryption. Such a scheme generates a search index, over the full-text

or a keyword, and encrypts this index. An authorized user is given a token, using

which files that contain a keyword can be retrieved. The token can only be

generated using a key which is only available to an authorized user. Without the

token, the index is not revealed. The output of such a retrieval process does not

reveal the contents of the files. It only indicates that the files have a keyword in

common.

First searchable encryption scheme was proposed in [55]. In the early years,

only single-keyword search was supported [1], [14], [20]. Multi-keyword search

involving conjunctive and disjunctive queries have been proposed in [6], [10], [27],

[51]. Also, [28] and later improved in [35], introduced the concept of searching on

a part of the encrypted data using attribute based encryption which also

concealed the token. These work indicate the possibility of performing search on

encrypted data, however, no real system even at a moderate scale has proved the

practicality of encrypted search. The complexity of such operations is high,

making them impractical for large scale operations such as on a cloud.

Recently, fully homomorphic encryption was developed in [24]. It allows

algebraic operations to be performed on plain data such that it is equivalent to

performing the operation on encrypted data. It is described as the "holy grail" of

cryptography. It would enable searching on encrypted data in a cloud system as if

searching on unencrypted data without exposing the data or the search query to

the cloud provider. The theoretical framework is in place but it is still far from

11

being practical. It takes immense computational power. A single Google search

would take would be one trillion times slower with fully homomorphic encryption

and in the future is expected to be reduced to 100,000 times the computation

required for unencrypted computing [22]. In [25] and more recently in [26], it is

evident that fully homomorphic encryption is not yet sufficiently efficient enough

for any practical application.

The idea of using a hybrid cloud has also emerged in the academia. The

work in [12] proposes to use two clouds. A cloud which is trusted encrypts the

data in the setup phase and performs security-critical operations. This data is sent

to an un-trusted cloud which handles high load of queries and communicates with

the trusted cloud using a secure channel. Still the performance aspect is not

discussed and we believe it is not as efficient as if encryption is still involved.

Secondly, the reliance on assigning one cloud as the trusted party does not

eliminate the possibility of a data breach within that cloud.

Using trusted hardware instead of a software service provider is introduced

in [5]. A server-side trusted hardware is used to achieve full privacy control on the

data. However, trusted hardware has performance bottleneck when working with

large data due to heat dissipation and are constrained in computing and memory

capacity. Besides, although their system achieves better performance than a

complete encryption based system, it is far less efficient than querying on an

unencrypted database. A trusted hardware is also not immune to attacks, such as,

during bootstrapping [45].

2.1 Encrypted cloud

12 Related Work

Most recently secure ranked keyword search over encrypted cloud data has

been proposed [58]. In this model, a secure searchable index is built from distinct

keywords that are a part of the data. The index and the encrypted data are sent

to the cloud and when a query in the form of a keyword is received, cloud

searches the index and returns ranked results. Searching over untrustworthy

servers and retrieving certain top results using a confidential index were discussed

in [62]. While, [56] presents ranked search over order preserving cryptographic

function. However, [62] is inefficient and [56] does not support dynamic changes in

score. The work in [58] uses a new order preserving symmetric encryption scheme

[9].

These work closely address the problem we present in this work, however,

there are various scenarios where a keyword search linking to a document is not

valid. For instance, trying to setup a confidential meeting or a personal task list,

the keyword-document pairing mechanism will not be appropriate as there are no

indexed documents for each entry. Also, the calendar for such tasks will be fed

into a relational database model. This makes our approach more suitable as we

address the problem as a natural database query operation, LIKE. Using our

approach a wider spectrum of problems can be addressed than the ranked

keyword search over encrypted data schemes.

2.1.3 Infeasibility of encryption alone for the cloud

Following from the previous section we can observe that encryption is not a viable

option for data confidentiality on the cloud. Apart from the overhead involved in

13

the process itself, a large overhead is involved in key management and keeping

safe. Apart from searchable encryption, Private Information Retrieval (PIR) [16],

which allows the data owner himself to query a database, faces similar challenges

as the former. The failure of cryptography as a stand-alone for achieving privacy

on the cloud is proved in [21]. They indicate that even full homomorphic

encryption is not capable for the cloud environment. The cost incurred by

employing cryptography on the cloud renders infeasible if considering only core

technology costs even with basic schemes such as AES, MD5, SHA-1, RSA and

DSA [15]. In addition, [15] also states that to break even, we need large amount of

data in the cloud (e.g. 109 tuples) and queries which return an infinitesimal

fraction of the data (e.g., 0.00037%).

Hence, we definitely need a new paradigm for searching on the cloud which

is efficient and provides data confidentiality. In our work we propose to use data

obfuscation using visual cryptography. To the best of our knowledge, no work has

been presented which combines visual cryptography with application to the cloud.

We do not use any standard encryption procedure that involves keys and keys

management. Our decryption procedure does not use any keys either and relies on

the human visual system to verify the data retrieved.

2.2 Data obfuscation Protecting a database using data obfuscation is a promising approach as it avoids

the large computational requirements associated with searching on an encrypted

database, and renders a feasible and practical system.

2.2 Data obfuscation

14 Related Work

The impossibility of achieving full confidentiality using obfuscation was

mentioned in [7], however, their work revolves around the standard encryption

paradigm and works for certain classes of operations. Our work achieves a "virtual

black box" model as we consider a different set of operations without using

encryption. Obfuscating the database such that only certain queries run on it, is

presented in [41]. Their work relies on server-side solutions, thus relies on a

trusted provider. In [39], obfuscation from a cloud perspective is introduced,

however, only weak privacy concerns are addressed, and some information about

the data is made available to the service provider. Also, [39], [41] involve key

management which introduces overhead.

There have been attempts at obfuscating the search query itself using some

noise [61]. Although, as highlighted in [47], an attacker can easily determine the

query using rudimentary attacks. Thus, a noisy query will not help in achieving

privacy on the cloud. Moreover, such a scheme assumes that the provider is a

trusted entity, which is not true all the time and exposes the system to various

threats as discussed in the preceding sections.

Thus, we believe the data along with the query itself, needs to obfuscated

to protect the data from an untrusted provider. We propose to use visual

cryptography to achieve data confidentiality.

2.3 Visual cryptography Visual cryptography was introduced in [40]. It relies on decryption using only the

human visual system where the data is in a visual form, such as, printed text or

15

pictures. Thus, it avoids the huge computational complexity associated with

standard encryption schemes discussed in preceding sections. Visual cryptography

relies on breaking up an image into multiple shares such that the image can be

reconstructed only when all the shares are available. A share is printed separately

and when all the shares are superimposed, the original image can be revealed.

While the preliminary work in this domain involved only black and white,

single images, later schemes involved two images and two shares [60], and later

[53] generalized for multiple images into two shares. This approach was extended

to colored images in [57] and more recently in [52], [59].

We consider using multiple cloud providers for data confidentiality. Data,

which is plaintext, is converted to images. Random noise is added to the images

to corrupt the data. Each image is divided into equal sized multiple shares and

sent to separate cloud providers. Cloud providers are not aware of each other's

presence and hence data is disjoint in the cloud. For data retrieval, a novel

method to create a mask is introduced which is used to reconstruct the original

data. Human visual system verifies the results.

We believe it's the first time visual cryptography has been used to achieve

data confidentiality on a cloud system and retrieve database style records based

on a search query.

16

Chapter 3 Data confidentiality using visual cryptography In this chapter, we describe the overall system design and algorithms developed in

detail. The necessary background information including the terminology, metrics

employed etc. are also explained.

3.1 Introduction Our work mainly deals with using visual cryptography to secure data in the cloud

instead of using encryption schemes. As discussed in the Chapter 2, all the

existing work in the domain of cloud computing security is focused on using some

encryption scheme for sending and retrieving data. This makes performing

operations such as a numerical calculation or a database query almost impossible

or highly inefficient. Our approach is to avoid using encryption. Instead, we

employ visual cryptography for sending and retrieving data. Moreover, we show

how efficient database operations can be without losing any data in the process.

Experimental results pertaining to our work are presented and analyzed in the

next chapter.

Data confidentiality using visual cryptography 17

crop the image into parts equal

to the #cloud providers

If criteria

satisfied

cloud sends

the image to

the user

user combines

the image and

evaluates it

using certain

metrics

retrieve the record

pertaining to the

query from the

clouds

Data to be retrieved based on a query

request data from

each cloud

3.2 Overall concept The overall concept of the system is simple: instead of sending the data in its raw

form to one cloud, convert the data into basic images and send part of the image

to different independent clouds. Figure 3.1 illustrates the process.

send to

cloud

convert to

image

Data to be sent to the cloud

#1 #2 #3 #4

18

We are considering data that only consists of ASCII characters. Data

containing images is not yet compatible with our system. The data should be in

the form of a simple text file, if it is not, it must be converted to the requisite

format. Following which, each ASCII character in the text file is represented by

an image. Each image is then cropped into as many equal parts as the number of

cloud service providers. Each part is sent to a different cloud. Note that all cloud

providers are independent and are not aware of each other's presence. The clouds

return an address to the user of where his data is stored. Like in a regular

database, the data itself is stored as database records

For data retrieval, the system is designed to respond to database style

queries, for instance, using the LIKE keyword means the user want to retrieve

records that contain a certain string of characters. The user searches the records

that satisfy the query and calculates certain metrics. Based on the result of these

metrics, the system retrieves the record from each cloud. Image merging and data

recovery occurs at the user end.

3.3 Sending data The user has some data in a simple text file that is to be uploaded to the clouds.

If the data is in another format, it must be converted to text. The data can

consist of any printable ASCII character. (Out of the total 127 ASCII characters,

94 are printable.) We will create a library of images corresponding to each ASCII

3.3 Sending data


character at the user end. The library needs to be constructed only once at the

user end. The next step is to convert the text into images.

3.3.1 Converting text to image

We propose using the BMP file format for converting text to images. The bits

representing the pixels are packed in rows, which allows easy image manipulation

for our system. To begin with, we create an ASCII library of images in BMP

format, (henceforth referred to as lib). Each image is constructed in Grayscale, 1-

bit depth and two colors, thus is in black and white. We fix the resolution as

40×40, with which each image is 382 bytes in size.

The image resolution can be increased if number of cloud providers to

which image is to be cropped and sent, is high. However, we would not want one

cloud storing too many parts of an image for security reasons. Thus, 40×40 seems

an optimum size. Also, note that other image formats such as JPEG or TIFF will

not allow 1-bit per pixel images, and thus will lead to larger file size. To allow

scalability on the cloud, we try to keep the image file size as low as possible.

Figure 3.2 shows some of the ASCII characters and their respective BMP

images. The images are displayed in their actual 40×40 resolution without any

resizing. These were generated using font Arial, style 'bold'. The lib can be

generated using any font, but preferably it should be close to the font used in the

text file. The text file has font as Lucida Console, style regular and size 10.

20

a

A

s

S

3

8

#

!

+

(

Figure 3.2 Image library of ASCII characters.

3.3 Sending data


(1)

3.3.2 Image obfuscation using noise

As data is read, its equivalent images are obfuscated by adding noise to each

image. Adding noise to the data severely decreases the possibility of discovering

the actual text. We propose to focus on Gaussian and speckle noise for our system

since they are two most important and common categories of noises for images.

We will use these noises for our system and compare their performance.

Gaussian noise

For Gaussian noise, the noise density follows a normal distribution, also known as

the Gaussian distribution. Mathematically, it is explained by (1). The mean and

1

√2

variance identify the normal distribution. The noise if often represented as

N( , ). In the following sections of the thesis, we will represent Gaussian noise

simply as ( , ). The mean controls where the mean of the distribution lies and

variance measures the width, that is, how concentrated the distribution is around

its mean. Generally speaking, as the variance increases, the image becomes

noisier.

The standard normal distribution has zero mean. Varying the mean

introduces skewness in the data, thus makes the distribution asymmetric. When

the mean is positive, distribution becomes right skewed and when the mean is

negative we get a left skewed distribution. Figure 3.3 illustrates the property of

skewness for Gaussian noise. Having a positive mean results in more high values,

that is white pixels (=1) as noise. The opposite occurs with a negative mean.

22

Figure 3.3 Normal distribution at different mean and variance.

While adding noise to the images, the original pixel is replaced by a black

or white pixel. So image size and other attributes remain the same after noise

addition. Note that in the original image, the text is black on a white background.

Thus, if the noise is composed of too many high values, that is has a positive

mean, the black colored text will have large white colored noise on it, however,

the original white background will have less black colored noise added to it.

Black pixels which are added to the text as noise, simply replace the

original black pixel on text with a noisy black pixel, thus text is not affected by

noisy black pixels. The opposite holds for the white background. The black noise

added to the background will be visible, but white pixels added as noise will

simply swap the original '1' in the background with a '1'.

3.3 Sending data


(0, 0.30)

(0.20, 0.30) (0.10, 0.30)

(2)

On the other hand, if the mean is negative, the background will be very

noisy as black colored pixels will be added as noise, however, the black text will

be much less noisy with very few white pixels on it. Figure 3.4 illustrates the

above concept. As it is evident, positive and negative mean disproportinally affect

the images for Gaussian noise. In the next chapter we show the experiments

carried out for determinig the optimum noise parameter for our system.

Figure 3.4 Gaussian noise.

Speckle noise

Speckle noise, also known as impulse noise, is multipicative in nature as indicated

in (2). Iorig is the original image and Inoise the output after adding noise, and n is

uniformly distributed random noise with mean 0 and variance .

Inoise Iorig n *Iorig

As speckle noise has = 0, we need to vary only . Due to the nature of

the noise, only black colored pixels are added to the image as noise. Thus, when

speckle noise is added, the original white pixels in the background are replaced by

black pixels, while the original black pixels on the text are replaced by black

(0, 0.50)

(-0.10, 0.30)(-0.20, 0.30) (0, 0)

24

(3)

0 0.15 0.39 0.50 1.00 1.50 2.00 3.00

(4)

pixels itself. Figure 3.5 shows sample images when speckle noise is added at

different variances. Note that mean is 0 in all cases.

Figure 3.5 Speckle noise.

Peak signal-to-noise ratio

Peak signal-to-noise ratio or PSNR is a measure of the ratio of maximum possible

signal power to the power of noise. It is expressed in decibels (dB) and defined as

below,

1, ,

20 ·√

The mean-square-error or MSE for two images I and K, both of dimensions m×n,

quanitfies the difference between the two images. In (4), MAXI is the maximum

possible pixel value in image I. PSNR is preferred since its mathematical

complexity is least among quality metrics such as SSIM and VQM, thus will put

least burden on the cloud while remainign a strong quality metric.

A lower PSNR implies lower quality of a signal. Thus, to increas the obfuscation

level and make it harder for an authorised user to determine the actual text, we

need a low PSNR. To determine how PSNR affects the data in our system, we

carried out experiments which are shown in Chapter 4.

3.3 Sending data


3.3.3 Data division across the cloud

After adding noise to the bitmap images, each image is split into as many equal

parts as the number of cloud providers. The cloud providers are independent and

are not aware of each other's presence or data held by other providers. The data

held by them is disjoint, that is, no two providers have any part of the data

common between them. Segregating the data among different clouds, maintains

the disjoint property and better preserves data confidentiality. This way even if

one provider is hacked and an unauthorized user accesses the data, the user is still

safe since only a small part of the data will be revealed. With respect to the

current visual cryptography schemes as discussed in Section 1.3, each split part of

the original image is essentially a secret image or a share.

For our system, we consider four cloud service providers. If the number of

cloud providers increase, the data is further divided, thus ensuring that each cloud

has a minimal amount of data. The affect of changing the number of providers on

the security of the data is experimented in Chapter 5. Figure 3.6 illustrates how

data will be handled by each cloud for the word 'Once' with four and record size

also four. All the characters, including spaces, will be split similarly.

Figure 3.6 Data carried by each cloud.

record

1 2 3 4

cloud 1

cloud 2

cloud 3

cloud 4

Once

26

3.3.4 Sending algorithm

In the previous sections we presented in detail the procedure of preparing the

data. Here we present the pseudocode for the algorithm, 'Algorithm 1: SEND

(input_data)', where input_data is a text file to be sent to the cloud.

Algorithm 1: SEND (input_data)

1 ns number of cloud service providers/servers 2 rs record size, that is, number of images per record 3 lib library of ASCII images in .bmp each of size p×p where (p mod ns) = 0 4 tmp temporary working directory 5 nor number of records 6 char[ ] 0, array to read data character by character until end_of_line is reached 7 length 0, length of char[ ] 8 while input_file 0 9 char[ ] read from input_data each character until end_of_line 10 for i 1 to char[i] '\0' do 11 copy to tmp char[i].bmp from lib 12 i i + 1, length length + 1 13 end for 14 add noise to all images in tmp 15 c 0 16 r 1 17 for j 1 to length do 18 crop char[j].bmp in tmp into ns parts, output is ns images of size (p/ns)×p 19 for k 1 to ns to 20 transmit ns parts of char[j]-ns.bmp to cloud k 21 k k + 1 22 end for 23 c c + 1 24 if c == rs then 25 c 0, r r + 1 26 end if 27 j j + 1 28 end for 29 end while 30 nor r

3.3 Sending data


In our system, we have four cloud providers, thus ns = 4. In Chapter 4 we

present the results and discuss the affect of increasing the cloud providers to 8.

Record size, rs, is 8 or 16 and the results for both are presented in the next

chaper. As we saw in Section 4.3.1, with an image size of 382 bytes, the record

size is 3056 or 6112 bytes, for rs 8 and 16 respectively. We read the data

character by character until we reach the end of line, '\0'. For the length of the

line, we copy the respective characters from our library of ASCII images lib into a

temporary directory tmp.

Next, the noise, Gaussian or speckle in our case, is added to all these

characters. Then, each character is split into as many equal parts as the number

of servers, ns. This requires that (p mod ns) = 0 for a p×p image. In our case we

have p as 40. The size of the image can be changed to fit the space requirements.

We believe that working with smaller images will reduce the space and

computation requirements. Thus, a p×p image is converted to ns images of size

(p/ns)×p. That is, for a 40×40 image and with 4 clouds, we get 4 images of size

10×40 if cropped row-wise.

Cropping along the column is also an alternative. For better security, other

combinations such as diagonal or even a random crop can be done. In the case of

random cropping, the user will have to store the random sequence so that on data

retrieval, user knows how to combine the images received from the clouds. Note

that random cropping puts more computational burden at the user. Lastly, the

number of records, each of size rs, is stored as nor, and will be used for data

retrieval. Figure 3.7 illustrates the sending process.

28

read the respective images from lib

add noise to images

split each image and send to clouds

cloud 1

cloud 2

cloud 3

cloud 4

1 2 3 4 5 6 7 8

position in a record

Figure 3.7 Sending data.

Note that even though clouds are independent, the system behaves as one

to the user. The user first logs into each of his cloud provider's account to whom

data is to be sent. When the user gives the command to send, the system connects

to the specified clouds internally, establishes independent channels of

communication, and sends the data.

H i c l o u d

3.3 Sending data

Hi cloud

text to be sent to the cloud

read each character in the text


3.4 Retrieving data

The overall motive for our work is to construct a system which can allow a user

to perform database operations on the cloud. A user must be able to send

database style queries to the cloud and the cloud must return records which

satisfy the query. In our work, we concentrated on developing a system which can

handle the LIKE database query. To be precise, the user should be able to retrieve

records beginning with a certain string, that is, in terms of database terminology:

'LIKE %query'. The seach query is composed in a simple text file and can consist

of any ASCII character.

Since our objective is to retrieve records which begin with a certain query,

for a query of length l, we need to evaluate the first l locations in a record. Then,

we need to check if each of the first l images in a record match with the respective

equivalent images of the search query. If the query string matches with the images

in the first l locations of the record, then we retrieve the record and send it to the

user.

The key to the procedure above is pattern matching. We need to

determine a suitable method to match the noisy data in the server with the

unnoisy images of the search query. Then a suitable metric needs to be assigned

to determine whether a match is found or not. We must also consider false

positives and unsuccessful searches. In the subsequent sections we will discuss

these issues further.

30

3.4.1 Image retrieval from the records

Once the query is received, the system retrieves the images stored at selected

locations in the beginning of a record from each provider. Recall that the number

of records at each cloud is the same and contain only a part of the original image.

The first image from the first record in each server is retrieved. Aligning

these images together, we will get the original noisy image of a character which

was cropped and sent to individual clouds. Thus, for instance, an original 40×40

noisy image of a character was split into four 10×40 images and sent to four cloud

providers. The same is repeated for all characters and they are stored in records of

size eigth at each cloud.

Using the terminology from Algorithm 1, let rij[k] indicate an element at

server i (1 i ns), record numbered j (1 j nor) and location of the element

inside record j, is k (1 k rs). Note that for our system ns=4, rs=8 and nor

depends on the length of the input text and rs. Then, in the first stpe of data

retrieval, images at record r11[1], r21[1], r31[1], r41[1] are retrieved and assembled to

produce the original noisy image of a single character. This character is matched

with the first character in the query string, which is not noisy. If a match is found

then r11[2], r21[2], r31[2], r41[2] are retrieved and assembled, and the constructed

image is matched with the second character in the query. This is repeated for all

the characters in the query string.

If a match is not found then r12[1], r22[1], r32[1], r42[1] is retrieved and

procedure repeated. The process of performing the match is explained in the next

section.

3.4 Retrieving data


unnoisy image stored in lib

noisy image from the cloud

bitwiseAND

mask

3.4.2 Matching the images

The images retrieved from the clouds are noisy. Pattern matching on these images

generates extremely chaotic results. Images must be denoised first.

Creating the mask

Note that we are working on bitmap images and each pixel is simply black or

white. The text is black and background is white. Thus, performing a bitwise

AND operation between noisy and unnoisy images of the same character will

produce an unnoisy image, which we call the mask. Then, we can compare the

mask with the actual image of the character in lib, which will result in an almost

perfect match. However, there is a problem as illustrated in Figure 3.8. The

fugure illustrates it for Gaussian noise.

Figure 3.8 Problem with creating the mask.

Clearly, the mask is noisy whereas we expected an unnoisy image. This is

because bitwise AND produces a 1 (=white) only when both inputs are 1 and in

all other cases we get output as 0 (=black). In the background, unnoisy image is

all 1, while noisy has both 0 and 1. Thus, the output remains noisy in the

32



bitwiseAND

intermediate mask

bitwise

NOT

mask


bitwiseAND

intermediate mask

mask


bitwise

NOTbitwise

NOT

bitwise

NOT

bitwise

NOT

bitwise

NOT

background. For the text, original image is all 0 in that part, and in noisy image

is 1 and 0, thus we get 0, that is black, as the text color in the mask. We cannot

perform a pattern matching with so much noise in the mask.

Instead, if we perform a bitwise NOT on the original and noisy images and

then perform AND between them to, and then again NOT the output of the last

step we will get a perfect mask. Note that in this mask the background is all

white, however, the text will remains noisy for Gaussian noise, while for speckle

noise, even the text will be perfectly unnoisy. For details on this refer to Section

3.3.2. Thus, if original image is Iorig and noisy image is Inoisy, we generate the mask

using equation (5) and the process is illustrated in Figure 3.9.

NOT NOT Iorig AND NOT Inoisy mask

Gaussian noise

Speckle noise

Figure 3.9 Creating the correct mask.

(5)

3.4 Retrieving data


Using De Morgan's law as expressed in (6),

A OR B NOT NOT A AND NOT B

we get, NOT NOT Iorig AND NOT Iorig Iorig OR Inoisy.

Thus, mask Iorig OR Inoisy.

With respect to data retrieval, the mask is created between the search query

character and record rij[k]. The masks are created on the fly and pattern matching

is performed as explained below.

Normalized cross-correlation

To perform the matching between two images, we employ the normalized cross-

correlation (NCC) metric as mentioned in [34] and indicated below in (9),

11

, ,

,

where, the similarity between images , and , , each having pixels is

calculated using (9). The variable and is the average and standard deviation

of respectively, and same holds for . The output is 1 if images match perfectly

and 0 if they do not match. We use NCC to determine if a mask and the search

query image match. As noted before, a mask is created every time an image in the

record is accessed during data retrieval. The NCC value, labelled as ncc

henceforth, between the mask and the character which is to be searched, is then

calculated. Detailed experiments were conducted using this metric, the results of

which are presented in Chapter 5. As an example, Figure 3.10 illustrated the

nature of the NCC metric.

(7)

(8)

(6)

(9)

34


noisy image from cloud

mask

ncc 0.5730 0.9020

noisy image from cloud

mask

ncc1.0000 0.7469

Gaussian noise Speckle noise

Figure 3.10 Pattern matching with NCC.

Note that ncc is between 0 and 1. It is important to define a threshold

value, thv, such that a match is considered positive only if ncc is beyond a the

threshold. If a very low threshold is chosen, we will have numerous false positives

where a match is determined by the system even though ncc is very low. If the

threshold is high, and image is very noisy, it might result in more unsuccessful

searches, since ncc will be low in case of high noise. Note that avoiding high

threshold is important for Gaussian noise because the mask created in this case

has a noisy text. Thus, ncc will never be 1 for Gaussian noise, and in fact, it will

drop as the level of noise increases in the characters present in the cloud. On the

other hand, high threshold does not affect speckle noise as the mask is a perfect

unnoised image which will produce ncc of 1 in case of a match, irrespective of the

amount of noise.

3.4 Retrieving data


3.4.3 Retrieval algorithm

Having presented in detail the procedure for data retrieval from the cloud in the

preceding sections, here we present the pseudocode of the retrieval algorithm,

'Algorithm 2: RETRIEVE(search_query)'. The content of the search_query is

contained in a text file and we want to retrieve records from the cloud which

begin with it. Only new variables introduced are indicated, while the rest are

explained in Agorithm 1. The variables ns and rs are same as the ones in the

sending and retrieval procedure and are specified in Algorithm 1. Variable nor

depends on rs and number of characters in the input_data and is sent by the

cloud to the user after send (input_data) is called. The search_query must not

exceed the maximum size of the record, that is, rs to avoid overflow.

Algorithm 2: RETRIEVE (search_query, ns, rs, nor)

1 query[ ] 0, array to read query character by character until end_of_string 2 length 0, length of query[ ] 3 fin directory where retrieved records are placed 4 ncc normalized cross-correlation (NCC) value 5 thv NCC threshold value 6 loc location identifier for a record rij[k] (refer to Section 4.4.1 for details) 7 found a boolean indicating whether record matching the query was found or not 8 query[ ] read from search_query each character until end_of_string 9 length length of query[ ] 10 for j 1 to nor do 11 found 1 12 for k 1 to length and found 1 do 13 loc 0 14 for i 1 to ns do 15 loc record location rij[k] 16 i i + 1 17 end for

36

18 noise[k] append records identified in loc and place them in tmp 19 original[k] copy query[k].bmp from lib to tmp 20 mask[k] original[k] OR noise[k] 21 ncc perform NCC between mask[k] and original[k] 22 if ncc thv then 23 found = 0 24 else found = 1 25 end if 26 k k + 1 27 end for 28 if found = 1 then 29 for k 1 to rs do 30 loc 0 31 for i 1 to ns do 32 loc record location rij[k] 33 i i + 1 34 end for 35 image[k] append records identified in loc and place them in fin 36 k k + 1 37 end for 38 complete record satisfying the search_query is retrieved and placed in fin 39 else record not found 40 j j + 1 41 end for

The details of the algorithm have been already discussed in detail in the

preceding sections. Below, Figure 3.11 illustrates the retrieval algorithm. Let the

input_data be computer science, which we already stored in the clouds with

number of cloud providers, ns, being 4 and record size, rs, being 8. Thus, as the

number of characters in input_data are 16, number of records, nor, will be 2. We

want to retrieve records that begin with computer, which constitutes the

search_query. Assume we are working with Gaussian noise with and as 0

and 0.30 respectively. Let ncc threshold, thv, be 0.8750.

3.4 Retrieving data


character matches: present record may contain the query

True

mask

1 2 3 4 5 6 7 8

position in record

2 cloud 1

1

2

1

2

2

cloud 2

cloud 3

cloud 4

search_query: computer

cloud system

ncc

False

character does not match: present record does not

contain the query

Figure 3.11 Retrieving data.

1

1

retrieve records: r111, r211, r311, r411

record

number

r111r211r311r411

append

retrieve unnoisy

c from lib mask creation

normalized cross‐correlation

retrieve records:r112, r212, r312, r412

and repeat until end of string in search_query

retrieve records: r121, r221, r321, r421

and repeat until end of string in search_query

ncc > thv

request access to each cloud

38

3.5 Complexity analysis

Analysing Algorithm 1 for sending data, we observe its time complexity is O ,

where is the number of characters in input text input_data. Initially we are

simply reading the input data and copying the respective images is a O 1

operation. Adding noise to the data is also O 1 . Cropping each image into as

many parts as the number of clouds and subsequently storing them at the cloud is

O 1 as well. For characters in the input text the time complexity is O .

For Algorithm 2, time complexity is dominated by the NCC procedure.

Clearly, the process of reading and assembling the images from the records,

calculating the mask and finally retrieving the complete record, is O / · ,

where / is the number of records and is the number of characters in the

query and . However, NCC procedure takes O using the algorithm

described in [34], where image dimensions are and template to be matched

is . In our case , where is the pixel dimension of the image such

that mod 0 . Thus, the complexity of retrieval is O / · · .

(Recall = 40 in our experiments and each image is 382 bytes.) Comparing this

complexity to standard encryption based schemes as presented in Section 1.1,

namely the recent work in [5], [37], our system incurs a far lower complexity.

The storage requirement at the cloud is O since in our scheme we are

storing images. This is more than the storage used when data is stored only as

unencrypted plaintext. It might seem that our scheme incurs slightly higher space

overhead than that incurred by encryption based schemes, however, the latter

3.5 Complexity analysis


incur large key management overhead as well. However, given how cheap storage

is on the cloud1, a slight increase in storage requirements is a small price for data

confidentiality and ability to run queries on the cloud. Clearly, the latter is still

not possible in encryption based schemes.

Regarding to the visual cryptography schemes discussed in Section 1.3.3,

the number of secret images or shares [40] is in our system. The space required

to store shares is equal to the space requirement for the original unnoisy

image. Adding noise does not increase the image size. Following noise addition, we

crop the noisy image into equal parts which are then directly stored at the

designated clouds without incurring any unnecessary space overhead.

The variable pixel expansion [40], which is the number of sub-pixels in the

generated shares that represent a pixel from the original image, is one for our

system, the minimum possible. A larger pixel expansion, as used in other visual

cryptography schemes [48], results in large storage requirements. In addition, we

are working with binary two-color images, which results in a small image size. We

do not believe using colored images instead is a better choice as that will lead to

large computational and storage overheads.

Thus, the total space requirement to store one image is O 1/ · =

O 1 . Moving the entire data to the cloud is thus O . Hence, in terms of cost

incurred by the user, the operation of storing parts of an image at clouds is

equivalent to storing one image at one cloud. Comparing our result with other

visual cryptography schemes [48], our approach is better in terms of complexity.

We discuss more on the security aspect in the following section.

1 For Amazon's S3, first 1 TB is $0.140 per GB and next 49 TB is $0.125 per GB. All data transfer to cloud the is free, and retrieval is free for first 1 GB/month and $0.120 per GB up to

10 TB/month. The price for all cases falls as usage increases. (All figures as on July 28, 2011.)

40

In our implementation of identifying the records and matching them, we

used simple hashing. This is easy to implement but suffers from overheads. We

can definitely use a better hash function which optimizes the time complexity.

3.6 Security analysis

Our approach to achieve data confidentiality on the cloud using visual

cryptography indicates a promising approach to provide security to an un-trusted

cloud provider. At any instance, a cloud holds only a part of the data. More the

number of clouds the data is divided into, better the security is. The cloud is

neither aware of which part of the data it holds, nor it knows how many other

clouds hold the remaining data. Data stored on the clouds is disjoint. The data

itself is highly noisy. Thus, any attempt by an attacker to extract meaningful

information from the data will be rendered useless as only noisy (garbage) data

will be returned.

Let us further assume that the attacker has access to optimum values of

the parameters used for noise addition and data retrieval. Even then, considering

a nominal case that one of the service providers is hacked, an attempt to run a

query and retrieve the data will not yield anything meaningful. We cover the

threat analysis in detail in Chapter 5. Extensive experiments are run to prove our

case.

Comparing our visual cryptography scheme to the current work in this

domain [48], the number of secret images or shares generated is . In the

3.6 Security analysis


preceding Section 3.5, we discussed the complexity for the number of shares. With

respect to security, the more shares we have, the better the security is. In our

case, we have multiple shares, yet it does not result in an increase in complexity.

Multiple shares allow confidentiality since if 1 shares are compromised and

available to an attacker, data will not be completely revealed.

Our framework protects against both types of attack on a cloud: (1) from

outside the system by an external agent, and (2) within the system by an internal

agent. We already explained (1) in the preceding page. For (2), consider an

internal employee eavesdropping on the data and trying to gain meaningful

information from it. The employee may be a self-motivated attacker trying to

steal the data, or supported by the resource provider itself where the provider

wants to know what the data is. With only 1/ part of the noisy data available,

full data retrieval will be impossible. Refer to Chapter 6 for results on threat

analysis that validate our hypothesis.

42

Chapter 4

Results

In this chapter, we describe the system design to evaluate the algorithms described

in Chapter 3. Further, experiments are run to test the effectiveness of the proposed

scheme and the results are discussed and analysed.

4.1 System design

The system comprises one computer acting as a client, which issues the commands

to store and retrieve the data, and four servers, which act as virtual cloud

providers. The four servers are independent and do not interact with each other.

The choice of four servers is based on the premise that each server should get only a

part of the data so that if one, or even two, server(s) are compromised, data is not

revealed. We later argue how varying the number of servers affects data

confidentiality.

For adding noise to the data, Gaussian and speckle noise were selected. As

discussed in Section 3.3.2, the right choice of noise parameter, that is, mean and

variance, is not a straight forward choice. First experiments are run to determine

the optimal noise parameter range and based on the range, we decide the NCC

threshold value, beyond which data retrieval from the clouds becomes infeasible.

43

4.2 Implementation

The system backend at the client and cloud providers is written in C. It handles the

basic task of sending and retrieving the data. Adding noise to the images was

accomplished using MATLAB [36] where a standalone executable was created for

noise addition. To perform the operations on the images namely cropping,

appending, creation of masks and calculation of NCC values, ImageMagick [30] was

used. The noise executable and the ImageMagick libraries are incorporated in C,

which is further combined with the backend to produce a single executable. The

entire system is accessible using a command line interface.

The data to be sent to and to be retrieved from the clouds is stored in a

simple text file. The library of images of ASCII character is generated using the

default font Arial, style bold. As mentioned in Section 3.3.1, each image is 40x40 in

Grayscale, two colors and depth one. Thus, each pixel is either black or white. Size

of each image is 382 bytes.1

4.3 Parameter estimation

We are focussing on evaluating two types of noise, Gaussian and speckle. The

experiments are to around determine how the system can be set for effective data

confidentiality on the clouds, following which, testing the system on a large dataset

to validate effective data retrieval. The first part of the experiments involve

evaluating noise parameters of the noise added to each character to judge how

effectively data is obfuscated. Based on the results from the first part, a sample text

1 When adding noise in MATLAB, each image is first converted to a double, noise is added, and then the output noisy image is again converted to a Grayscale, 1-bit depth image of 382 bytes.


44 Results

is sent to the clouds for storage. Data retrieval is initiated using a search query and

the NCC value, ncc, is calculated between the data retrieved and the actual data,

as described in Chapter 3. The third part of the experiment involves determining an

optimal ncc threshold, thv, where false positives and unsuccessful searches cease to

exist. Based on the results, the system is tested on a larger dataset where multiple

queries are sent to the cloud for retrieval. Both noises are tested to evaluate their

effectiveness.

4.3.1 Noise parameters

As discussed in Section 3.3.2, peak signal-to-noise ratio (PSNR) is a sound metric to

measure the effectiveness of the noise added to the characters. Below, we discuss

how varying the mean and variance of the noises affects PSNR and thus how

effectively data can be concealed in the clouds.

Gaussian noise

For Gaussian noise, our experiment varies the mean ( ) from -0.25 to 0.25 and

variance ( ) from 0 to 1 in increments of 0.05, and mean PSNR of all the ASCII

characters for each ( , ). Since noise is added to all ASCII characters and each

has an equal probability of being queried, lower the mean PSNR of the characters

is, more obfuscated the image will become. Figure 4.1 plots noise variance against

mean PSNR and Figure 4.2 shows the slope of PSNR. The slope plotted on the

figure is negative of the actual slope so that the change in slope is visually clearer to

notice.

45

Figure 4.1 Mean PSNR of all ASCII characters for Gaussian noise.

Figure 4.2 Slope of mean PSNR of all ASCII characters for Gaussian noise.


46 Results

It is evident from the figures, that PSNR drops sharply before =0.20 and

becomes asymptotic at =0.35 and beyond for all values of . Thus, =0.20 to

0.35 represents an optimal range for noise addition.

Following our discussion in Section 3.3.2, we further discuss the skewness for

Gaussian noise here. Taking into account the PSNR values, we can see that

negative and positive means unevenly distort the distribution. Though it is

desirable to have noise parameters such that PSNR is the lowest, however, a

negative mean leads to a more noisy background and only a moderately noisy text.

In our algorithm, the background is effectively filtered out by employing the

mask. Thus, as long as the PSNR of an image is low enough to render the

text noisy enough for effective obfuscation, the noise in the background is not

relevant. For positive mean, though PSNR is less than the case of negative mean,

the text portion is noisier. Hence, positive and negative skewness both portray their

advantages and disadvantages for now. For further experiments, we will restrict the

system to =-0.10 to 0.10 to avoid generating noise from a larger asymmetric

distribution. Figure 4.3 shows some images at different ( , ) with their

individual PSNR values.

(-0.20, 0.30): 5.2560 (-0.10, 0.30): 5.7573 (0.10, 0.30): 7.7924 (0.20, 0.30): 8.4239

Figure 4.3 Images at different ( , ) with their PSNR values in dB .

(0, 0.30): 7.0997

(0, 0.50): 5.9666

47

Speckle noise

As noted in Section 3.3.2, speckle noise is essentially a multiplicative noise

with =0, while variance can be varied. Speckle noise only adds black pixels to the

image, on both background and the text. We vary from 0 to 5 in increments of

0.05. Figure 4.4 plots the mean PSNR of all characters against the variance and

Figure 4.5 plots the negative of the slope.

Figure 4.4 Mean PSNR of all ASCII characters for speckle noise.

Figure 4.5 Slope of mean PSNR of all ASCII characters for speckle noise.


48 Results

The figures indicate that mean PSNR stabilizes at =1.85 to 2.00 and

becomes an asymptote after this point. Increasing the variance beyond 2.00 does

not yield a significant reduction in PSNR as it is evident from the slope, thus we

will limit our system to the range determined above.

4.3.2 Sending and retrieving data

Based on the implications in the previous section, we conducted experiments for

sending and retrieving actual data on the cloud system. The data consists of a book

chapter with 784 characters (including white spaces). As discussed in Chapter 3, we

want to focus on retrieving records which begin with a certain string. In our test, we

use the search query the, as it is the most frequently occurring word in the English

language [44]. The size of a record is kept as 3056 bytes, which means 8 images

compose a record as one image is 382 bytes. Following the discussion in Section 4.1,

the number of cloud service providers is 4.

Gaussian noise

Table 4.1 summarizes the ncc observed for Gaussian noise at different values of

and for search query the. The numbers 43, 61 and 91 refer to the actual

location of the records where the occurs in the beginning.

It is revealed from Table 4.1 that a negative indeed causes a negative

skewness leading to a large number of pixel values to be low, that is 0 (black).

When the mask is created with such a noisy image, the background will have more

noise with black pixels while the black text will have less noise. The correlation of

such a mask with the original image will lead to a high ncc, as indicated in 3.4.2.

49

Mean ( ) = -0.10 = 0.20 = 0.25 = 0.30 = 0.35

43 61 91 43 61 91 43 61 91 43 61 91 t 0.9618 0.9502 0.9657 0.9541 0.9502 0.9463 0.9463 0.9345 0.9463 0.9306 0.9306 0.9227h 0.9596 0.9468 0.9698 0.9494 0.9210 0.9468 0.9443 0.9132 0.9364 0.9417 0.8843 0.9106e 0.9715 0.9483 0.9715 0.9570 0.9278 0.9570 0.9454 0.9130 0.9454 0.9366 0.8981 0.9249

Mean ( ) = -0.05 = 0.20 = 0.25 = 0.30 = 0.35

43 61 91 43 61 91 43 61 91 43 61 91 t 0.9618 0.9520 0.9541 0.9502 0.9385 0.9463 0.9345 0.9306 0.9227 0.9266 0.9187 0.9227h 0.9571 0.9443 0.9647 0.9443 0.9158 0.9391 0.9417 0.9001 0.9106 0.9236 0.8790 0.9054e 0.9570 0.9308 0.9570 0.9454 0.9160 0.9454 0.9366 0.8981 0.9308 0.9219 0.8830 0.9071

Mean ( ) = 0.00 = 0.20 = 0.25 = 0.30 = 0.35

43 61 91 43 61 91 43 61 91 43 61 91 t 0.9541 0.9502 0.9502 0.9385 0.9345 0.9345 0.9266 0.9187 0.9227 0.9026 0.9187 0.9187h 0.9494 0.9210 0.9443 0.9417 0.9080 0.9262 0.9314 0.8790 0.9054 0.9027 0.8629 0.8922e 0.9512 0.9160 0.9541 0.9425 0.9011 0.9366 0.9308 0.8830 0.9130 0.9190 0.8739 0.8951

Mean ( ) = 0.05 = 0.20 = 0.25 = 0.30 = 0.35

43 61 91 43 61 91 43 61 91 43 61 91 t 0.9463 0.9345 0.9463 0.9266 0.9306 0.9227 0.9067 0.9187 0.9187 0.8905 0.9147 0.9147h 0.9443 0.9132 0.9314 0.9314 0.8843 0.9054 0.9054 0.8683 0.8922 0.8922 0.8521 0.8736e 0.9454 0.9100 0.9454 0.9337 0.8830 0.9190 0.9190 0.8739 0.8951 0.9041 0.8678 0.8708

Mean ( ) = 0.10 = 0.20 = 0.25 = 0.30 = 0.35

43 61 91 43 61 91 43 61 91 43 61 91 t 0.9187 0.9187 0.9187 0.9345 0.9345 0.9345 0.8905 0.9187 0.9147 0.8823 0.9067 0.9147h 0.9158 0.8736 0.8922 0.9417 0.9001 0.9106 0.8922 0.8548 0.8710 0.8869 0.8303 0.8656e 0.9219 0.8739 0.8981 0.9366 0.8951 0.9308 0.9011 0.8678 0.8708 0.8860 0.8585 0.8616

Table 4.1 NCC for Gaussian noise. A positive leads to a positive skewness, thus majority of the values are relatively

higher, that is 1 (white). The text in the mask has a large number of white pixels as

the noise. Thus, ncc for such an image is low as correlation becomes tough with


50 Results

such a noisy text. In both the cases of positive and negative , ncc drops steadily

with increasing . This indicates successful data retrieval becomes increasingly

tough with increasing variance.

Speckle noise

As the mean is 0 in speckle noise, we only need to vary the variance in this

case. Table 4.2 summarizes the observation.

Mean ( ) = 0.00 = 1.85 = 1.90 = 1.95 = 2.00

43 61 91 43 61 91 43 61 91 43 61 91 t 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 h 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 e 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000 1.0000

Table 4.2 NCC for speckle noise.

The ncc is observed to be exactly 1 for all the cases. This is owing to the

nature of the noise. Note that speckle noise is multiplicative and thus large amount

of black noise is added to the white background since it is bright while the text

essentially has black noise added to it, thus not affecting it. However, if variance is

high such as 2.00, the noise added to the white background renders the image with

so much black noise that the text itself is obfuscated in the process, leading to very

low PSNR as noted in Section 4.3.1.

4.3.3 NCC threshold

When retrieving data from the clouds, NCC plays a key role for a successful

retrieval. A low ncc threshold might lead to a large number of false positives, that

51

is, records that do not actually begin with the search query, will be incorrectly

identified as satisfying the search query. A large ncc threshold might cause

unsuccessful searches, that is, if the observed ncc is smaller than the threshold set,

records that do satisfy the search query will be ignored. Thus, it is a non-trivial task

to identify the correct ncc threshold thv.

Gaussian noise

Using the same system settings as in Section 4.3.2, the results for thv estimation for

Gaussian noise are presented in Table 4.3. The number of false positives for

different thv are indicated in each cell for a range of ( , ) values. The cells with

0, indicate no false positives, which is the threshold we are searching. Cells with -1

indicate unsuccessful searches, that is the search query was not found. This occurs

when ncc of any character in the search string is less than the threshold, following

which that record character is overlooked. Note that the -1 cases are more

detrimental to the system as a low false positives is still acceptable in real systems

than missing out on the data altogether.

Figure 4.6 presents the key findings from the above observations. The

number of false positives are high for low threshold values as expected. It is

interesting to note that the average number of false positives across all the four

variances fall by 38.51% from =-0.10 to =0.10 at a low thv of 0.7. At a high thv

of 0.825, the fall is by 93.75%. At thv=0.85, ten contiguous observations with zero

false positives is observed from (0.00, 0.25) to (0.10, 0.30). This is a good indicator

since working with a range of ( , ) allows flexibility in setting up a large scale

system which caters to different search queries. At thv=0.875, we can observe


52 Results

unsuccessful searches at (0.05, 0.35) and beyond. Thus, the block from (0.00, 0.25)

to (0.05, 0.30) allows the flexibility to have a range of thv from 0.85 to 0.875 which

is again useful for a large system setup.

thv

False positives

= -0.10 = -0.05 = 0.00 = 0.05 = 0.10

0.20 0.25 0.30 0.35 0.20 0.25 0.30 0.35 0.20 0.25 0.30 0.35 0.20 0.25 0.30 0.35 0.20 0.25 0.30 0.35

0.7000 45 44 43 42 39 39 39 38 38 37 35 34 31 29 29 28 29 28 25 25

0.7250 38 37 36 35 33 31 30 29 26 24 26 25 24 23 21 20 23 20 20 18

0.7500 29 28 28 27 26 24 25 21 24 21 19 17 19 18 15 12 15 13 9 7

0.7750 24 23 23 21 19 18 13 11 15 14 1 8 12 10 6 6 6 6 4 4

0.8000 14 13 10 8 13 12 7 4 7 4 4 3 5 3 3 2 2 2 1 0

0.8250 5 4 4 3 4 2 2 2 3 2 2 1 2 2 1 0 1 0 0 0

0.8500 2 2 2 0 2 2 1 0 2 0 0 0 0 0 0 0 0 0 0 -1

0.8750 1 1 0 0 1 1 0 0 1 0 0 0 0 0 0 -1 -1 -1 -1 -1

0.9000 1 0 0 -1 0 0 -1 -1 0 0 -1 -1 0 -1 -1 -1 -1 -1 -1 -1

0.9250 0 -1 -1 -1 0 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

0.9500 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1 -1

Table 4.3 NCC threshold estimation for Gaussian noise.

53

Figure 4.6 Optimum NCC threshold for Gaussian noise.

The parameter (0.05, 0.20) provides a good setting as it lies almost midway

between (0.00, 0.25) and (0.05, 0.30) and more importantly the thv spans till 0.9

without leading to false positives or misses. Thus, thv=0.875 is a good parameter as

it is supported by zero false positives above and below it. Hence, we can

convincingly set the system to =0.05, =0.20 and thv=0.875 for further

experiments with Gaussian noise. Speckle noise

The results for speckle noise are presented in Table 4.4. Owing to the nature of

speckle noise, there is no unsuccessful search at any threshold. Moreover, the false

positives cease to exist when thv=0.95. Figure 4.7 clearly shows the results at

different variances.


54 Results

Table 4.4 NCC threshold estimation for speckle noise.

Figure 4.7 Optimum NCC threshold for speckle noise.

thv False positives = 1.85 = 1.90 = 1.95 = 2.00

0.7000 63 63 64 63 0.7250 52 53 53 52 0.7500 43 42 42 42 0.7750 39 38 39 40 0.8000 36 35 36 36 0.8250 25 24 25 25 0.8500 14 13 14 14 0.8750 6 7 6 7 0.9000 4 4 4 4 0.9250 2 2 2 2 0.9500 0 0 0 0 0.9750 0 0 0 0 1.0000 0 0 0 0

55

It is noteworthy to notice that the numbers of false positives are almost

constant despite an increasing variance. This again is attributed to the

multiplicative nature of noise addition. Thus, even when the image is greatly

distorted due to noise, which is at high variance values, data retrieval is more

effective in case of speckle than Gaussian noise. From these results we can

convincingly assign thv=0.99 for speckle noise.

Hence, we have now found the optimum parameter for both Gaussian and

speckle noise. Before we proceed testing on a large dataset, it is interesting to

compare the effectiveness of these noises. As we observed, for Gaussian noise

optimum parameters are =0.05 and =0.20. From Section 4.3.1 and Figures 4.1

and 4.2, the mean PSNR for all ASCII characters at (0.05, 0.20) is 8.0723 dB. For

speckle, optimum is at =2.00 and the mean PSNR for all ASCII characters is

4.6580 dB. Thus, speckle is about 42% more effective in noise addition for our

system. In the following experiments, we will evaluate both noises with their

optimal parameters to evaluate their performance on a larger scale.

4.4 Multiple queries on large dataset Based on the above findings we now test our system on an independent dataset

with both Gaussian and speckle noise. We selected a book chapter which is to be

sent to the cloud. The system portrays the following statistics:

number of characters in the data = 5370

size of each record, rs = 3056 bytes (8 images per record. 382 bytes each)

total number of records, nor = 683

4.4 Multiple queries on large dataset

56 Results

number of cloud service providers simulated, ns = 4

Gaussian noise: = 0.05 and = 0.20

Speckle noise: = 2.00 ( = 0 for speckle noise)

thv: Gaussian noise = 0.8750 and speckle noise = 0.99

Table 4.5 summarizes the result for both the noises:


search query

number of occurrences

false positives

unsuccessful search

false positives

unsuccessful search

an 11 1 0 0 0 ing 6 0 0 0 0 of 4 0 0 0 0 as 6 0 0 0 0 s. 1 0 0 0 0 and 1 1 0 0 0 the 9 0 0 0 0 all 4 1 0 0 0 995 1 0 0 0 0 We 1 0 0 0 0 we 4 0 0 0 0 because 1 0 0 0 0 warming 1 0 0 0 0 green 2 0 0 0 0 Earth 2 0 0 0 0 have 2 0 0 0 0 know 1 0 0 0 0 on 3 0 0 0 0 he 10 1 0 0 0 me 2 1 0 0 0

Table 4.5 Data retrieval with multiple search queries on a 5K character dataset.

57

The results indicate that the system successfully retrieved all the queries

with no misses. There were no false positives for speckle noise, while Gaussian

reported only five. The system could also identify correctly between lowercase and

uppercase characters of the same alphabet and also the space character and

numbers. In case of multiple occurrences of a query, all were successfully retrieved.

On investigating the false positives it is revealed that out of the five in case

of Gaussian (speckle has zero), three are due to errors in determining

lowercase/uppercase character. That is, for instance, in case of all, the one false

positive is due to the system recognizing All as all. The system is able to detect

long search queries, symbols, uppercase and lowercase version of the same character

with 100% accuracy. The total number of search queries is 72 and number of false

positives for Gaussian noise is 5, which accounts for 6.94% of all queries. Speckle

noise has zero false positives.

As the last series of experiments, we tested the system with longer search

queries and a larger dataset to reflect a more realistic scenario. We increased the

record size by double to test longer queries. The remaining parameters are the same

as in previous experiments. Following are the system parameters:




number of cloud service providers simulated, ns = 4

The findings are summarized in Table 4.6. The system has consistently zero

unsuccessful searches. The number of false positives is small for both the noises.


58 Results

Table 4.6 Data retrieval with multiple search queries on a 10K character dataset.


search query

number of occurrences

false positives

unsuccessful search

false positives

unsuccessful search

computer 4 0 0 0 0 programming 1 0 0 0 0 basic 4 0 0 0 0 on set B (say) 1 0 0 0 0 numbers: 1 0 0 0 0 simultaneously 1 0 0 0 0 comp 8 0 0 0 0 comm 1 0 0 0 0 to 9 3 0 2 0 too 1 0 0 0 0 machine 1 0 0 0 0 from the wheels 1 0 0 0 0 in fact 2 1 0 1 0 element 2 0 0 0 0 alphabets. 1 0 0 0 0 " 8 0 0 0 0 "hardware" 1 0 0 0 0 operation 2 0 0 0 0 multi 2 0 0 0 0 Cipher, with A 1 0 0 0 0 For 1 0 0 0 0 for 3 0 0 0 0 That is, why 1 0 0 0 0 Also, our 1 0 0 0 0 Let us examine 1 0 0 0 0 R.L. Brown 1 0 0 0 0 quite a variety! 1 0 0 0 0 buses - to 1 0 0 0 0 It's a deep ques 1 0 0 0 0 what? 1 0 0 0 0

59

Referring to Table 4.6, which shows the test results of a dataset twice the

size of the former, and caters to 64 queries which are much more complex than the

previous, the number of false positives for Gaussian is 4, that is 6.25% of the total

queries, and speckle has 3 which makes it 4.69%. The final results are presented in

Figure 4.8.

Figure 4.8 Data retrieval on >5K and >10K character dataset

with multiple search queries.

Hence, we conclude that both Gaussian and speckle noise can be used for

data obfuscation in our system with 100% accuracy. The number of false positives

are very low and do not affect the system accuracy. It is evident that Speckle noise

performs slightly better than the Gaussian.


60 Results

4.5 Running time

Before we proceed, we would like to discuss the running time of the above

experiments. All results were generated on a 64-bit, 3.4 GHz Intel Core i7 2600

processor with 8 GB RAM. At peak usage, processor and memory consumption was

only about 20% each.

The average running time for a query was determined based on observations

on >5K and >10K character datasets. To give an accurate picture, we report time

per unit character. That is, time required to send or retrieve one character to and

from the cloud, respectively. The sending time depends on the number of characters

in the text. Thus, we report _ _ / for sending time. For retrieval,

time depends primarily on the number of records . The query length also has

effect on the retrieval time, but we chose queries with varying length for our

experiments. Hence, an average result on retrieval time suffices. Thus, the retrieval

time is _ _ / .

Note that we are working with images, thus, time per character intrinsically

means time taken to send the equivalent image of a character and retrieve the

equivalent image from the cloud system. Recall each image is 382 bytes. In

addition, the figures were averaged for both the noises as running times were within

0.1 ms of each other. Results are presented in Table 4.7. We show results for

different cloud configurations, varying the number of clouds, , and record size .

All figures are in millisecond.

61

Sending time (in ms)

Retrieval time (in ms)

4 8 22.21 28.53 4 16 19.40 14.08 8 8 34.17 23.29 8 16 30.92 11.75

Table 4.7: Running time measurements.

The running times are reasonably fast. It is interesting to notice that as

is doubled for same , sending time increases by 35%. However, for the same case,

retrieval time drops by 18%. The drop can be attributed to the fact that its

computationally less expensive to withdraw small parts of a fixed size object rather

than large chunks. When doubles, the number of parts of an image also double

and data held by each cloud halves. During retrieving when each cloud is asked to

transmit their part of the image, a cloud sends images which are half the size of

these when is not doubled. Thus, it is beneficial to have more clouds not only

from a security perspective but also from a computation point of view.

If is doubled while keeping the same, sending time drops by 12.5%.

The reason for this is that writing to continuous locations in a record is better than

writing data to continuous records. With a higher the system has less overhead

in going from one record to the next. One the other hand, retrieval time drops by

50% as doubles while is constant. This is also due to the fact that

traversing through a single record and reading the data is quicker than going from

one record to the next. We reckon all these are attributed to caching since we are

reading from continuous locations in a record. Thus, it is beneficial to have longer

records from a computation perspective.

4.5 Running time

62 Results

However, from s security perspective it is not advisable to have long records.

If record length is small, there will be more number of records for a given input text.

In case of an attacker eavesdropping and trying to determine from which record the

query is being retrieved from, smaller the number of records are, smaller is the

attack domain for an attacker. We want to generate as many false positives for an

attacker, which can only occur when the number of records is high, that is, record

length is small.

Note that the program is not written to exploit a multi-threaded

architecture. We are using just one thread at a time. If we can modify the code and

make it multi-threaded, the running time will come down by a large margin. And,

using parallelization while processing the images will also make the system much

faster. Calculating the NCC is a time consuming task and we would like to

investigate how to speed this operation. We strongly believe it is possible as

currently we are using a very basic way of calculating the NCC, which is not

optimized for speed.

Also, we are not using any indexing services on the database. Simple hashing

is used for identifying the records. Simple hashing is easy to implement but not an

optimal choice. We believe a good hash function can reduce the running time and

also add security to the system.

We still have not yet discussed how the system will behave in case of an

attack. If one or more servers are sabotaged, how much data will an adversary be

able to retrieve? In the next chapter we discuss these circumstances.

63

Chapter 5 Threat Analysis

In this chapter, we discuss how the system reacts in the event of an unauthorized

access. Various scenarios are developed and extensively tested to determine the

system resilience to attacks.

5.1 Threat scenarios Our system achieves data confidentiality by storing an authorized user's data in

the form of images at independent cloud providers. The providers are not aware of

each other's presence. Also, the data stored in them is disjoint, thus, no two

clouds have any data in common between them. Only the authorized users can

access the cloud for adding or retrieving data.

When an unauthorized user tries to access the data and targets a specific

cloud provider, only a part of the data will be accessible to him. That is, for ns

clouds, only 1/ns part of the data is revealed to the attacker. In such a scenario,

when an attacker sends a search_query, only the cloud which has been

compromised by the attacker will return the actual data while the rest of the

providers will return noisy data.

64

As for of our system, consider ns=4, and assume the authorised user has

already stored the data in the clouds. From the discussion at the end of Section

3.3, note that the user has to connect to each of the clouds and the system

establishes independent channels of communication between the user and the

providers. Consider an attacker who is able to gain access to one of the clouds

and wants to retrieve all the data stored on the cloud by a user. When he sends a

search_query, only the cloud which has been compromised will return the actual

data, while others will simply return noise. Figure 5.1 and Figure 5.2 show how

the data, for instance, character a, will appear to an attacker when one and two

clouds are breached for ns=4.

Figure 5.1 Data when one cloud out of four is breached.

Figure 5.2 Data when two clouds out of four are breached.

data available to an authorised user

data available to an attacker for ns=4, when one cloud is breached

1 32 4

data available to an authorised user

data available to an attacker for ns=4, when two clouds are breached

1 2 1 41 3

2 3 2 4 3 4

5.2 Experimental setup

Threat Analysis 65

We suppose that the attacker is not aware of which cloud holds which part

of the data. Thus, if he plans to attack one cloud, each cloud has an equal

probability of holding any of the one-fourth part. In other words, the probability

of an attacker breaching a cloud i for a total of ns clouds, is i/ns. For four clouds,

if one cloud is to be breached, the attacker has equally probable options:

1 2 3 4,

While if two clouds are to breached, we have equally likely scenarios:

1 2 1 3 1 4 2 3 2 4 3 4.

In Section 3.3.3, we briefly discussed how the number of cloud providers

affect the overall security of the system. We postulate that as the number of

clouds increases, data will be further divided. Hence, in case of a breach, a very

small fraction of the data will be leaked to the attacker. We will test our system

with ns=8 as well to validate our claim.

5.2 Experimental setup To test the threat scenarios, we sent some data to the cloud for storage. As we

saw in Section 4.4, although both Gaussian and speckle noise performed well, the

performance of Gaussian noise was slightly lesser than the latter. Thus, we can

assume that the chance of breaching a cloud where data obfuscation is done using

Gaussian noise, is higher. Hence, for the following experiments we will use

Gaussian noise for image obfuscation. The system has the following parameters:

66




number of cloud service providers simulated, ns = 4 and ns = 8

Gaussian noise: = 0.05 and = 0.20

thv: Gaussian noise = 0.8750

The parameters for Gaussian noise are the same as in Chapter 4. As a

worst-case scenario, we also assume that the attacker is aware of the optimum

threshold to use for correlation. We test with the following ten queries:

six an in for the

My grown-up wing to any one

In other words, we want to see if the attacker can successfully retrieve

records for a certain input query by breaching a certain number of clouds. Table

5.1 indicates the number of cases for which we tested the system. For a specific

ns, the integers denote the clouds breached and in brackets we indicate the

fraction of total data leaked to the attacker.

clouds breached

ns = 4 ns = 8 1

(25%) 2

(50%) 1

(12.5%) 2

(25%) 4

(50%) number of

combinations tested

40 60 80 280 700

Table 5.1 Threat scenarios tested.

5.2 Experimental setup

Threat Analysis 67

5.3 Results

We measured the number of successful retrievals and false positives for a specific

search string. Results for all ten queries were averaged for a specific number of

clouds breached, that is, for each column in Table 5.1. The rational to average the

results is that since an attacker decided to breach one cloud, and total number of

clouds is four, all four combinations are equally likely. The attacker is not aware

of the total number of clouds into which the data is divided. Attacker only

decides the number of clouds he wants to breach.

Figure 5.3 and Figure 5.4 illustrate the results for ns = 4 and ns = 8,

respectively. Each observation point represents results for a single cloud

configuration for the ten queries tested. That is, for instance, with ns = 4 and one

cloud breached, all ten queries for this cloud configuration were tested. The false

positives and successful retrieval were recorded and converted into percentage. In

the figures, it represents one observation point each for the false positives and

successful retrieval. In other words, each dot in the graph represents ten query

results. For one cloud breached there are = 4 combinations, thus (10×4=) 40

recordings. Refer to Table 5.1 for other scenarios. Similar explanation holds for

other test scenarios as well.

Results are summarized in Table 5.2. In a total of ten queries, some had

single occurrences, while others had multiple. In total, fifteen records should be

retrieved for a 100% successful retrieval. Lower the success rate and higher the

false positive are, better the resilience of the system to an attack is.

68

Figure 5.3 False positives and successful search with four clouds.

Figure 5.4 False positives and successful search with eight clouds.

5.3 Results

Threat Analysis 69

clouds breached

ns =4 ns = 81

(25%) 2

(50%) 1

(12.5%) 2

(25%) 4

(50%)

successful retrieval*

5.0 37.8 0.0 5.7 37.2

false positive*

25.0 18.9 2.5 5.5 15.1

* all figures in %.

Table 5.2 Results for different attack scenarios: average false positives

and successful retrievals.

5.4 Analysis

The results are encouraging. What is noticeable is that even with 25% of the data

in the clouds breached, the success rate for an attacker is just about 5%. Even

when 50% of the data is available to the attacker, he still can only retrieve less

than 40% of the queries correctly. The number of false positives is also moderate.

For our system, we want the false positives to be high in case of an attack. A high

number would confuse the attacker more and make it more difficult to retrieve

any useful information from the data. An attacker may use some artificial

intelligence or machine learning technique to extract some pattern on the queries

and in the process try to estimate the data. With a good number of false

positives, this would make such an attempt increasingly difficult for an attacker.

70

In real life, a 50% breach is too high a number. We believe an attacker

may be able to access about 10-20% of the data in a realistic scenario. Also, we

tested the system with just four and eight providers. For better resilience to

attacks, this number will be much higher in realistic cases. From the table we can

see that when only one server out of eight, that is about 12% of the data, is

breached, the success rate is zero for an attacker.

We can also observe that the ratio of successful retrieval (and false

positive) to the amount of data breached, is almost constant. This indicates the

consistency and robustness of our system. Given the large number of cloud

providers in the industry, the more providers we have, lower the success rate for

an attacker will be, with a favorable ratio to the number of false positives. Thus,

with respect to the theoretical security analysis in Section 3.6, our results validate

our claim. Also, note that the attacker here can be an external agent or even

internal to the service provider. In either case, data confidentiality is maintained.

Thus, even the cloud provider itself cannot determine the data with full accuracy.

In our experiments when the attacker has access to some of the clouds and

tries to retrieve data from the entire system, the clouds to which attacker does

not have access, will return garbage or noisy data. These uncompromised clouds

add the same noise pattern to the output which was used to initially obfuscate

the original data before sending to the cloud. This should theoretically facilitate

the attacker when he creates the mask and attempts retrieval.

Also, the optimum ncc threshold, thv, which was determined as described

in Section 4.3.3, is provided to the attacker in our tests. In realistic cases, this is

5.4 Analysis

Threat Analysis 71

only known to the user and even the cloud is not aware of it. Even when armed

with thv, the attacker is unable to fully retrieve the data. Our experimental setup

was a best-case scenario for an attacker, still the results are favour strong data

confidentiality. Moreover, the user can verify the integrity of the data anytime.

One may simply evaluate random records to see if they have been corrupted or

not.

We would like to highlight that there are cases when an attacker is able to

retrieve parts of the data. The figures represented in the previous section are

average figures. Overall data is safe from an attack, however, as revealed in

Figures 5.3 and 5.4, there are instances when the attacker is successful at

retrieving the records. On closer inspection, we realize the reason for this is how

the data is split. When a 40×40 image is cropped into four or eight parts, the top

most and bottommost parts do not contain the text portion. The parts belonging

to the middle of the image contain maximum text. Even when noise is added to

these middle sections, there is a possibility an attacker may be able to reveal the

data. The high success rate observed in some cases is due to the hacker having

access to these middle parts of the image.

In a realistic scenario, we note that which cloud has which part of the

data, is secret. Thus, the attacker has an almost equal probability of attacking a

cloud that has top or bottom parts of the image, and attacking a cloud which has

the middle region. To overcome this scenario we propose a modified strategy.

When cropping an image, a certain percentage of the top and bottom

regions of the image should be split into a nominal number of parts, such as four.

72

Even if these top and bottom parts are revealed to an attacker, since they hardly

contain the text part, no meaningful data can be revealed. The middle region of

the image should be split into more parts, for instance in eight or more, than the

end regions. Now, when these middle parts are sent to the cloud providers, each

cloud will have a very small part of the original image. This will render a

situation similar to the case when only one out of eight parts is revealed to an

attacker. Such a scheme will allow full data confidentiality. If a user is limited by

the number of cloud providers, then two parts of the image instead of one can be

sent to the same cloud, provided the two parts are as far apart as possible in the

original image. This is analogous to the concept of hamming distance used in

visual cryptography [40].

5.4 Analysis

73

Chapter 6

Conclusions and Future Work

In this chapter we present the conclusion for our work and suggest improvements

to the existing framework.

6.1 Conclusions We introduced a novel method to achieve data confidentiality in the cloud

computing environment. The cloud provider is considered untrustworthy and the

data must be concealed not only from an outside attack but the provider itself

must not be able to extract meaningful information from the data. Instead of

relying on one cloud service provider, we propose to use multiple (untrustworthy)

public cloud providers. We use visual cryptography to protect the data on the

cloud. Standard encryption schemes are avoided, yet we achieve strong privacy of

the data. A new visual cryptography scheme for binary images is introduced for

our system. The complexity of our approach is shown to be reasonable and much

less than standard encryption based schemes. We also avoid any overhead

associated with key management, as required by encryption. Besides achieving full

data confidentiality, a user can also run database style queries on the system for

efficient data retrieval, which is not possible with encryption based schemes.

74

The system was tested for a simple query scheme on small and large

datasets. Results indicate that the system is able to retrieve data successfully with

zero failure rate and very few false positives. Threat analysis of the system with

worst-case security conditions indicates the system is resilient to attacks and data

leaks. Attacks not only from outside the cloud, but also from within the cloud will

not be successful. A cloud owner itself will be unaware of the contents of the data

even during data retrieval operations.

We believe our system is best suited for storing sensitive data such as

medical records and financial transactions. At present, cloud providers do not

allow storing encrypted data as search cannot be performed on it. Thus, current

schemes cannot achieve both, data confidentiality and efficient query execution.

With our approach data such as credit card information, a person's health record

can be stored and queried on the cloud. At the expense of a small computational

overhead, which is much less than the encryption based schemes, we achieve

query execution and data confidentiality.

6.2 Future work

The results are encouraging for our system and indicate that our work has the

potential for a large-scale application. To this end, we propose some key

improvements and suggestions for future systems.

To validate our proposed method, we conducted small-scale experiments

and restricted ourselves to a basic LIKE database query for retrieval operations.

Our system can be easily modified to include other basic database queries as well.

However, we do believe that complex queries involving JOIN operations will not be

impossible but indeed be tough to include to our system.

6.2 Future work

Conclusions and Future Work 75

In addition, with a larger more range of queries to work with, we expect to

test our system with a larger database and more query operations. We relied on a

simulated cloud environment for our experiments. The system must be tested on

real clouds to better evaluate the complexity and overhead of implementation. It

would also be interesting to see how the system behaves with different noise

models besides Gaussian and Speckle. These two noises deemed sufficient for a

preliminary analysis, however, further investigation with noise types is pending.

We also acknowledge that the running time for our system is moderately

good. We strongly believe we can optimize our system from a programming

perspective and increase the speed by a good factor. Using multi-threading and

parallelization will increase the performance many-folds. A fast approach to

calculate the NCC should also be investigated since we are using a very basic

version for calculating NCC, which is not optimized for speed. Also, indexing the

database and using a better hash function will lower the retrieval time by a good

margin. How these can be accommodated in a secure manner such that the service

provider cannot exploit them, must be investigated.

With respect to our visual cryptography technique, instead of a simple

horizontal crop operation, a pseudo-random approach can be investigated which

will add security to the system. However, the overhead involved in such a scheme

must be balanced such that performance does not suffer. We also expect to test

the system against artificial intelligence/machine learning based attacks where an

attacker may employ techniques to gain useful information from the data.

76

References [1] M. Abdalla, M. Bellare, D. Catalano, E. Kiltz, T. Kohno, T. Lange, J.

Malone-Lee, G. Neven, P. Paillier, and H. Shi, "Searchable encryption revisited: Consistency properties, relation to anonymous ibe, and extensions," CRYPTO 2005. LNCS vol. 3621, pp. 205–222. Springer, Heidelberg (2005).

[2] Amazon, "Amazon Web Services: Overview of Security Processes", May 2011.

http://awsmedia.s3.amazonaws.com/pdf/AWS_Security_Whitepaper.pdf. [Accessed: June 14, 2011.]

[3] Amazon S3, "Amazon Simple Storage Service FAQs". http://aws.amazon.com/s3/faqs/#How_secure_is_my_data. [Accessed: Feb

20, 2011.] [4] M. Armbrust, A. Fox, R. Griffith, A. Joseph, R. Katz, A. Konwinski, G. Lee,

D. Patterson, A. Rabkin, I. Stoica, and M. Zaharia, "Above the Clouds: A Berkeley View of Cloud computing," Technical Report No. UCB/EECS-2009-28, University of California at Berkley, USA, Feb. 10, 2009.

[5] S. Bajaj and R. Sion, "TrustedDB: a trusted hardware based database with

privacy and data confidentiality," in Proceedings of the 2011 international conference on Management of data (SIGMOD '11), ACM, New York, NY, USA, pp. 205-216, 2011. [doi=10.1145/1989323.1989346]

[6] L. Ballard, S. Kamara, and F. Monrose, "Achieving efficient conjunctive

keyword searches over encrypted data," in Proceedings of the Seventh International Conference on Information and Communication Security (ICICS '05), pp. 414-426, 2005.

References 77

[7] B. Barak, O. Goldreich, R. Impagliazzo, S. Rudich, A. Sahai, S. P. Vadhan, and K. Yang, "On the (Im)possibility of Obfuscating Programs," in Proceedings of the 21st Annual International Cryptology Conference on Advances in Cryptology, pp.1-18, August 19-23, 2001.

[8] J. Bardin, J. Callas, S. Chaput, P. Fusco, F. Gilbert, C. Hoff, D. Hurst, S.

Kumaraswamy, L. Lynch, S. Matsumoto, B. O'Higgins, J. Pawluk, G. Reese, J. Reich, J. Ritter, J. Spivey, J. Viega, "Security guidance for critical areas of focus in cloud computing," [Online.] Cloud Security Alliance, Technical report, April 2009. Available: https://cloudsecurityalliance.org/csaguide.pdf. [Accessed: Mar 5, 2011].

[9] A. Boldyreva, N. Chenette, Y. Lee, and A. O’Neill, “Orderpreserving

symmetric encryption,” in Proc. of Eurocrypt ’09, vol. 5479 of LNCS. Springer, 2009.

[10] D. Boneh and B. Waters, "Conjunctive, Subset, and Range Queries on

Encrypted Data," in Proceedings of the 4th Theory of Cryptography Conference (TCC '07). LNCS vol. 4392, pp, 535-554, Springer, Heidelberg (2007).

[11] J. Brodkin. "Gartner: Seven cloud-computing security risks", July 2, 2008.

http://www.networkworld.com/news/2008/070208-cloud.html. [Accessed: Nov 18, 2010.] [12] S. Bugiel, S. N urnberger, A. Sadeghi, and Thomas Schneider, "Twin Clouds:

An architecture for secure cloud computing (Extended Abstract)," in Workshop on Cryptography and Security in Clouds (WCSC '11), Zurich, March 15-16, 2011.

[13] R. Buyya, C. S. Yeo, S. Venugopal, J. Broberg, and I. Brandic, "Cloud

Computing and Emerging IT Platforms: Vision, Hype, and Reality for Delivering Computing as the 5th Utility" in Future Generation Computer Systems, Elsevier Science, Amsterdam, The Netherlands, 2009.

[doi: http://dx.doi.org/10.1016/j.future.2008.12.001]

78 References

[14] Y. C. Chang and M. Mitzenmacher, "Privacy preserving keyword searches

on remote encrypted data." in Applied Cryptography and Network Security Conference (ACNS '05). LNCS, vol. 3531, Springer, Heidelberg (2005).

[15] Y. Chen and R. Sion, "On securing untrusted clouds with cryptography," in Proceedings of the 9th annual ACM workshop on Privacy in the electronic society (WPES '10), pp.109-114, 2010. [16] B. Chor, O. Goldreich, E. Kushilevitz, and M. Sudan, "Private information

retrieval," in 36th IEEE Conference on the Foundations of Computer Science, pp. 41–50. IEEE Computer Society Press, 1995.

[17] Cloud Security Alliance, "CSA Cloud Security Alliance Security Guidance

for Critical Areas of Focus in Cloud Computing V2.1," December 2009. Available: https://cloudsecurityalliance.org/csaguide.pdf.

[Accessed on: Mar 8, 2011.] [18] CNBC, "Amazon Failure Takes Down Sites Across Internet," April 21, 2011.

http://www.cnbc.com/id/42706104/. [Accessed: April 25, 2011.] [19] CSO Online, "Avoid Patriot Act Surprises: Encrypt Cloud Data on-

Premise", July 20, 2011. http://www.cio.com.au/article/394098/. [Accessed: July 21, 2011.] [20] R. Curtmola, J. A. Garay, S. Kamara and R. Ostrovsky, "Searchable

symmetric encryption: Improved definitions and efficient constructions," in Conference on Computer and Communications Security (CCS '06), ACM Press, New York, 2006.

[21] M. van Dijk and A. Juels, "On the impossibility of cryptography alone for

privacy-preserving cloud computing," in Cryptology ePrint Archive, Report 2010/305, 2010. Available: http://eprint.iacr.org/2010/305.

References 79

[22] Forbes, "DARPA Will Spend $20 Million To Search For Crypto’s Holy Grail," April 6, 2011.

Available: http://blogs.forbes.com/andygreenberg/2011/04/06/darpa-will-spend-20-million-to-search-for-cryptos-holy-grail/. [Accessed: April 28, 2011.]

[23] Gartner, "Gartner Says Worldwide Cloud Services Market to Surpass $68

Billion in 2010," June 22, 2010. http://www.gartner.com/it/page.jsp?id=1389313. [Accessed: May 2, 2011.] [24] C. Gentry, "Fully homomorphic encryption using ideal lattices," in

Proceedings of the 41st ACM Symposium on Theory of Computing (STOC), pp. 169–178. ACM, New York, 2009.

[25] C. Gentry and S. Halevi, "Implementing Gentry's fully-homomorphic

encryption scheme," in Cryptology ePrint Archive, Report 2010/520, 2010. Available: http://eprint.iacr.org/2010/520.

[26] C. Gentry, "Fully Homomorphic Encryption without Bootstrapping," in

Cryptology ePrint Archive, Report 2011/277, 2011. Available: http://eprint.iacr.org/2011/277.

[27] P. Golle, J. Staddon, and B. Waters, "Secure conjunctive keyword search

over encrypted data," in Applied Cryptography and Network Security Conference (ACNS '04). LCNS, vol. 3089, pp. 31-45. Springer-Verlag, 2004.

[28] V. Goyal, O. Pandey, A. Sahai and B. Waters, "Attribute-based encryption

for fine-grained access control of encrypted data," in Proceedings of the 13th ACM conference on Computer and communications security (CCS '06), pp. 89-98, 2006.

[29] IDC, "IT Cloud Services User Survey, pt.2: Top Benefits & Challenges," Oct

2, 2008. http://blogs.idc.com/ie/?p=210. [Accessed: May 16, 2011.] [30] ImageMagick, ImageMagick 6.7.1-1, 2011. http://www.imagemagick.org.

80 References

[31] P. T. Jaeger, J. Lin, J. M. Grimes, and S. N. Simmons, "Where is the cloud? Geography, economics, environment, and jurisdiction in cloud computing," First Monday, vol. 14, no. 5, 2009.

http://firstmonday.org/htbin/cgiwrap/bin/ojs/index.php/fm/article/view/2456/. [Accessed: Jan 10, 2011.]

[32] S. Kamara, and K. Lauter, "Cryptographic cloud storage," in Proceedings of

Financial Cryptography: Workshop on Real-Life Cryptographic Protocols and Standardization, Tenerife, Canary Islands, Spain, 2010.

[33] L. M. Kaufman, "Data Security in the World of Cloud Computing," IEEE

Security and Privacy, vol.7, no.4, pp. 61-64, 2009. [34] J. P. Lewis, “Fast normalized cross-correlation,” Vision Interface, pp. 120–

123, 1995. [35] A. Lewko, T. Okamoto, A. Sahai, K. Takashima, and B. Waters, "Fully

secure functional encryption: Attribute-based encryption and (hierarchical) inner product encryption," in Advances in Cryptology-EUROCRYPT 2010, LNCS vol. 6110, pp. 62-91. Springer, Heidelberg (2010).

[36] Matlab Users Guide, The Math Works, Inc., Natick, MA.

http://www.mathworks.com/help/techdoc/. [Accessed: March 20, 2011.] [37] L. Ming, Y. Shucheng, C. Ning and L. Wenjing, "Authorized Private

Keyword Search over Encrypted Data in Cloud Computing," in 31st International Conference on Distributed Computing Systems (ICDCS '11), pp. 383-392, 2011. [doi=10.1109/ICDCS.2011.55.]

[38] P. Mell and T. Grance, "Effectively and securely using the cloud computing

paradigm," NIST, Information Technology Laboratory, 2009. Available: http://csrc.nist.gov/groups/SNS/cloud-computing/cloud-computing-v26.ppt

References 81

[39] M. Mowbray, S. Pearson, and Y. Shen, “Enhancing privacy in cloud computing via policy-based obfuscation,” in Journal of Supercomputing, pp. 1-25, 2010.

[40] M. Naor and A. Shamir, “Visual cryptography,” in Advances in Cryptology:

EUROCRYPT’ 94, LNCS, vol. 950, pp. 1–12, Berlin, Springer-Verlag, 1995. [41] A. Narayanan, and V. Shmatikov, "Obfuscated databases and group

privacy," in ACM Conference on Computer and Communications Security, pp. 102–111. ACM Press, New York, 2005.

[42] Network Computing, "Encryption Is Cloud Computing Security Savior", Nov

19, 2009. http://www.networkcomputing.com/security/229502349. [Accessed: June 18, 2011.] [43] Network World, "The U.S. Patriot Act has an impact on cloud security,"

Sep 29, 2009. http://www.networkworld.com/newsletters/2009/092909cloudsec1.html.

[Accessed: April 19, 2011.] [44] Oxford Dictionaries, " The OEC: Facts about the language". http://oxforddictionaries.com/page/oecfactslanguage/the-oec-facts-about-

the-language. [Accessed: Dec 16, 2010.] [45] B. Parno, "Bootstrapping trust in a “trusted” platform," in HOTSEC 2008:

Proceedings of the 3rd conference on Hot topics in security, Berkeley, CA, USA, USENIX Association, pp. 1–6, 2008.

[46] PC World, "Google Docs Glitch Exposes Private Files", March 9, 2009. http://www.pcworld.com/article/160927/google_docs_glitch_exposes_priv

ate_files.html. [Accessed: Feb 12, 2011.] [47] S. T. Peddinti and N. Saxena, "On the effectiveness of anonymizing

networks for web search privacy," in Proceedings of the 6th ACM

82 References

Symposium on Information, Computer and Communications Security (ASIACCS '11), pp. 483-489, 2011, ACM, New York, NY, USA.

[48] P. S. Revenkar, A. Anjum, W .Z.Gandhare. "Survey of Visual Cryptography

Schemes," in International Journal of Security and Its Applications, vol. 4, no. 2, pp. 49-56, 2010.

[49] T. Ristenpart, E. Tromer, H. Shacham, and S. Savage, "Hey, you, get off of

my cloud: exploring information leakage in third-party compute clouds," in Proceedings of the 16th ACM conference on computer and communications security, pp. 199-212, Chicago, Illinois, USA, 2009.

[50] RSA, "The Role of Security in Trustworthy Cloud Computing," 2010.

http://www.emc.com/collateral/about/investor-relations/9921_CLOUD_W P_0209_lowres.pdf. [Accessed: May 22, 2011.]

[51] E. Shi, J. Bethencourt, T-H. H. Chan, D. Song and A. Perrig, "Multi-

Dimensional Range Query over Encrypted Data," in Proceedings of the 2007 IEEE Symposium on Security and Privacy, pp. 350-364, 2007.

[52] S. J. Shyu, "Efficient visual secret sharing scheme for color images," Pattern

Recognition, vol. 39, pp. 866–880, 2006. [53] S. J. Shyu, S.Y. Huanga, Y.K. Lee, R.Z. Wang and K. Chen, "Sharing

multiple secrets in visual cryptography," Pattern Recognition, vol. 40, pp. 3633–3651, 2007.

[54] Simple-Talk, "Transparent Data Encryption", March 16, 2010. http://www.simple-talk.com/sql/database-administration/transparent-data-

encryption/. [Accessed: March 28, 2011.] [55] D. Song, D. Wagner and A. Perrig, "Practical Techniques for Searches on

Encrypted Data," in Proceedings of the 2000 IEEE Symposium on Security and Privacy, pp. 44-55, May 14-17, 2000.

References 83

[56] A. Swaminathan, Y. Mao, G.-M. Su, H. Gou, A. L. Varna, S. He, M.Wu, and D.W. Oard, “Confidentiality-preserving rank-ordered search,” in Proc. of the Workshop on Storage Security and Survivability, 2007.

[57] E. R. Verheul and H. C. A. van Tilborg, "Constructions and properties of k

out of n visual secret sharing schemes," in Designs, Codes and Cryptography, vol. 11, pp. 179-196, 1997.

[58] C. Wang, N. Cao, K. Ren, and W. Lou, "Enabling Secure and Efficient

Ranked Keyword Search over Outsourced Cloud Data," in IEEE Transactions on Parallel and Distributed Systems (TPDS), 2011. (To be published.)

[59] D. Wang, F. Yia and X. Li, "On general construction for extended visual

cryptography schemes," Pattern Recognition, vol. 42, pp. 3071–3082, 2009. [60] C. Wu and L. Chen, "A study on visual cryptography," Master's thesis,

Institute of Computer and Information Science, National Chiao Tung University, Taiwan, R.O.C., 1998.

[61] S. Ye, F. Wu, R. Pandey and H. Chen, "Noise Injection for Search Privacy

Protection," in Proceedings of the 2009 International Conference on Computational Science and Engineering, pp. 1-8, 2009.

[62] S. Zerr, D. Olmedilla, W. Nejdl, and W. Siberski, “Zerber+r: Top-k retrieval

from a confidential index,” in Proc. of EDBT ’09, 2009.

data confidentiality and keyword search in the cloud using...

Documents